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BI-CROSS-VALIDATION OF THE SVD AND THE 
NONNEGATIVE MATRIX FACTORIZATION^ 

By Art B. Owen and Patrick O. Perry 

Stanford University 

This article presents a form of bi-cross- validation (BCV) for choos- 
ing the rank in outer product models, especially the singular value de- 
composition (SVD) and the nonnegative matrix factorization (NMF). 
Instead of leaving out a set of rows of the data matrix, we leave out 
a set of rows and a set of columns, and then predict the left out 
entries by low rank operations on the retained data. We prove a self- 
consistency result expressing the prediction error as a residual from a 
low rank approximation. Random matrix theory and some empirical 
results suggest that smaller hold-out sets lead to more over-fitting, 
while larger ones are more prone to under-fitting. In simulated ex- 
amples we find that a method leaving out half the rows and half the 
columns performs well. 

1. Introduction. Many useful methods for handling large data sets begin 
by approximating a matrix X G ]R»^x" by a product LR, where L G M"*^'^ 
and R G M'^^" both have rank k < min(?Ti,n). Such outer product models 
are widely used to get simpler representations of large data matrices, ei- 
ther for regularization or for interpretation. The best known example is the 
truncated singular value decomposition (SVD) [Eckart and Young (1936); 
Hansen (1987)]. Other examples arise by placing constraints on the factors 
L and R. For instance, the nonnegative matrix factorization (NMF) [see 
Lee and Seung (1999)] requires L and R to have elements in [0,oo) and 
the familiar /c-means clustering of rows of X imposes a binary structure on 
L. These and some other examples are described in Lazzeroni and Owen 
(2002). 

For all of these methods, the choice of k is problematic. Consider the SVD. 
The best rank k approximation to X, judging by squared error, is obtained 
by truncating the SVD to k terms yielding X^''^ . Larger values of k provide 
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a better fit to the original X and for k = min(77T,, n) we get X^*^) = X. This 
clearly raises some risk of overfitting. If X is generated from a noisy process, 
then that noise is reasonably expected to affect X^^^ more strongly when k 
is large. 

We would like to cross-validate the rank selection in order to counter 
such overfitting. Hoff (2007) states that the usual practice for choosing k is 
to look for where the last large gap or elbow appears in a plot of singular 
values. Wold (1978) mentions a strategy of adding terms until the residual 
standard error matches the noise level, when the latter is known. Cross- 
validation (CV) is operationally simpler. We do not need to know the noise 
level, and instead of judging which bend in the curve is the most suitable 
knee, we select k to minimize a numerical criterion. 

Several methods have been proposed for cross-validating the SVD. Of 
particular interest is the approach taken by Gabriel (2002). Writing X as 

X = ( "^^'^ ^1,2: n \ 

\X2:m,l X2 : m,2 : n / 

Gabriel fits a truncated SVD of k terms to X2:m,2:ni yielding -^^2- m 2- n' 
then scores the error in predicting Xn by e^ii = Xn — Xi^2: n(^2^L 2: n)'''^2: m,i) 
where denotes the Moore-Penrose pseudo-inverse of the matrix Z. Leav- 
ing out every point in turn, the cross- validated squared error is YllLi Yl^=i{^ifY ^ 
where 1^^^ is defined analogously to e^^ . Gabriel's approach is a kind of bi- 
cross- validation (BCV), as it leaves out a row and column simultaneously. 

Our main goal in this paper is to develop BCV into a tool that is generally 
applicable to outer product approximations, just as CV is for i.i.d. sampling. 
We generalize the BCV to r x s holdouts because in large problems 1x1 
holdouts are unworkable. As a result, there is an {h x ^)-fold BCV that is a 
direct analogue of fc-fold CV. An advantage of CV is that it applies easily 
for different methods and error criteria. We show here how BCV can be 
extended to more general outer product models. We use the NMF to show 
how that works in detail. 

The NMF is often thought of as a bi-clustering of rows and columns, and 
of course /c-means is clearly a clustering method. The BCV we present thus 
provides a form of cross-validation for unsupervised learning. 

Our second goal is to understand BCV. Gabriel (2002) focuses on the 
bi-plot and does not explain or motivate BCV. We show a self-consistency 
property for BCV whereby the residual e^'^) vanishes for some matrices X 
of rank k. This residual can be explained in terms of principal-components 
regressions, and it can be generalized in several ways. 

The outline of this paper is as follows. Section 2 provides some back- 
ground on MacDuffee's theorem and the SVD. Section 3 describes how the 
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rank of an SVD is chosen in the crop science hterature and explains why 
the solutions prior to Gabriel (2002) are unsatisfactory. Then it presents the 
self-consistency lemma mentioned above. Section 4 looks at rank k matri- 
ces that fail to satisfy an assumption of the self-consistency lemma. These 
involve a kind of degeneracy in which the entire signal we seek to detect 
has been held out. Section 5 shows how to extend BCV from the SVD to 
the NMF and similar methods. Section 6 applies some recent results from 
random matrix theory to BCV of the SVD with Gaussian noise. From this 
analysis, we expect small holdouts to be more prone to overfitting and large 
holdouts more prone to underfitting. Section 7 has some numerical examples 
for BCV of the SVD. Section 8 has examples of BCV for the NMF. Section 9 
summarizes the results. 

1.1. Related methods. Many alternatives to cross- validatory choice are 
based on attempts to count the number of parameters in a rank k model. 
Some of these methods and their difficulties are mentioned in Section 3. 

There are a great many classical methods for choosing k in the context of 
principal components analysis. There one usually has a number m ^ oo of 
i.i.d. rows (with variance S) while the number of columns n is fixed. Mar- 
dia, Kent and Bibby (1979), Chapter 9.5, presents a likelihood ratio test 
for whether the last n — k eigenvalues of S are identical for Gaussian data. 
Holmes-Junca (1985) compares cross-validation and the bootstrap. Jolliffe 
(2002), Chapter 6, surveys many methods including techniques based on 
scree plots, broken stick models and the bootstrap. Scree plots can be quite 
unreliable. It has long been known that even when all n eigenvalues of S are 
identical, there will be some slope in the scree plot, and numerical work in 
Jackson (1993) finds that a method based on scree plots gives poor perfor- 
mance in some numerical examples motivated by ecology. His simulations 
also included broken stick models, some bootstraps, some likelihood ratio 
tests and the Kaiser-Guttman rule that retains any principal component 
whose eigenvalue is above average. Some broken stick models worked best. 

Minka (2000) takes a Bayesian model selection approach. He obtains a 
posterior distribution on the data matrix under a Gaussian model, with 
the choice of k corresponding to an assumption that the smallest n — k 
eigenvalues of S are equal. The chosen k is the one that maximizes a 
Laplace approximation to the posterior probability density of the data set. 
The Laplace method gave the best results in simulations. It beat a cross- 
validatory method based on holding out rows of the data matrix. 

Our setting is different from the ones for principal components. We will 
have m and n both tending to infinity together and neither rows nor columns 
need to be i.i.d. 

Hoff (2007) treats the SVD under a model where X is a random isotropic 
low rank matrix plus i.i.d. Gaussian noise. He presents a Gibbs sampling 
approach for that setting. 
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The microarray literature has some similar methods for imputing missing 
values. Troyanskaya et al. (2001) fit an SVD model to the nonmissing values, 
using an EM style iteration, and then the low rank fitted values are used for 
imputation. Oba et al. (2003) present a Bayesian version of SVD imputation, 
using an EM iteration. 

The present setting is different from the imputation problems. We have 
all the values of X and we want to smooth some noise out of it via low rank 
approximation. The retained data set will consist of rectangular submatrices 
of X so that the SVD or NMF or other algorithms can be applied to it 
without requiring missing data methods. 

S. Wold (1978) cross-validates the rank of an SVD model by leaving out a 
scattered set of matrix elements. He advocates splitting the data set into 4 to 
7 groups. In his Figure 1, each such group corresponds to one or more pseudo- 
diagonals of the data matrix, containing elements Xij, where j = i + r, for 
some r. Having left out scattered data, Wold has to solve an imputation 
problem in order to define residuals. He does this via the NIPALS algorithm, 
a form of alternating least squares due to H. Wold (1966). 

2. Background matrix algebra. This section records some facts about 
matrix algebra for later use. 

The Moore-Penrose inverse of a matrix X £ i^"ix" ig denoted by X~^ £ 
]^nxm_ rpj^g monograph of Ben-Israel and Greville (2003) provides a detailed 
presentation of the Moore-Penrose inverse and other generalized inverses. 
We will also treat scalars as 1 x 1 matrices, so that x'^ for x G M is 1/x when 
X y^O and is when x = 0. 

If matrices A and B are invertible, then {AB)~^ = B~^A~^ when the 
products are well defined. Such a reverse-order law does not necessarily 
hold when the inverse is replaced by a generalized inverse. 

There is a special case where a reverse order law holds. Ben-Israel and 
Greville (2003), Chapter 1.6, credit a private communication of C. C. Mac- 
Duffee for it. Tian (2004) describes further sufficient conditions for a reverse 
order law. 

Theorem 1 (MacDuffee's theorem). Suppose that X = LR, where L G 
M™^^ and G M^''" both have rank k. Then 

(2.1) X+ = R+L+ = R\RRY\L'Ly^L'. 

Proof. We find directly from the properties of a Moore-Penrose inverse 
that L+ = {L'L)-^L' and R+ = R'{RR')-^. Finally, the RHS of (2.1) satisfies 
the four defining properties of the Moore-Penrose inverse of LR. □ 

The SVD is discussed in detail by Golub and Van Loan (1996). Let 
X G M^-x" and set p = min(m,n). The SVD of X is, in it's "skinny" ver- 
sion, X = UJ:V', where U G M"*^?' and V G W^'p with U'U = V'V = Ip 
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and S G = diag(o-i, 0-2, . . . , cjp) with cti > (T2 > • • • > dp > 0. The SVD 
provides an exphcit formula = VTi~^U' = J2^=icrfviu[ for the Moore- 
Penrose inverse of X . 

In apphcations the SVD is very useful because it can be used to construct 
an optimal low rank approximation to X. For X G M'"^*^, we use to 

denote ^Jj27^i Xf- , its Frobenius norm. The approximation theorem 
of Eckart and Young (1936) shows that the rank k matrix closest to X in 
Frobenius norm is 

k 

xW^Y^amv-. 

i=l 

3. Cross-validating the SVD. The idea behind cross-validating the SVD 
is to hold out some elements of a matrix, fit an SVD to some or all of the 
rest of that matrix, and then predict the held out portion. 

Several strategies have been pursued in the crop science literature. There 
i typically indexes the genotype of a plant and j the environment in which 
it was grown, and then Yij is the yield of food or fiber from plant type i 
in environment j. The value Xij is commonly a residual of Yij after taking 
account of some covariates. A low rank approximation to X^j can be used 
to understand the interactions in yield. 

We will ignore the regression models for now and focus on choosing k for 
the SVD. What makes model selection hard for the SVD is that methods 
based on counting the number of parameters used in an SVD model do not 
give reliable F tests for testing whether a setting known to have at least 
rank k actually has rank + 1 or larger, dos S. Dias and Krzanowski (2003) 
describe some F tests for discerning between ranks A; = and 1 that reject 
66% of the time when the nominal rate should be 5%. Some other tests 
remedy that problem but become too conservative for k > 0. 

3.1. Leaving out a matrix element. Suppose that we wish to hold out the 
entry Xij and predict it by some Xij computed from the other elements. 
The best known method is due to Eastment and Krzanowski (1982). They 
fit an SVD ignoring row i and another one ignoring column j. They use 
the left singular vectors from the SVD that ignored column j and the right 
singular vectors from the SVD that ignored row i. Therefore, Xij is not used 
in either of those SVDs. From two SVDs they get two sets of singular values. 
They retain the first k < min(m — l,n — 1) of them and combine them via 
geometric means. Louwerse, Smilde and Kiers (1999) generalize this cross- 
validation method [as well as that of Wold (1978)] to choose the rank of 
models fit to m x n x /c arrays of data. 

Unfortunately, separate row and column deletion typically yields cross- 
validated squared errors that decrease monotonically with k [dos S. Dias 
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and Krzanowski (2003)]. Besse and Ferre (1993) present an explanation of 
this phenomenon using perturbation theory. As a result, some awkward ad- 
justments based on estimated degrees of freedom are used in practice. 

A second difficulty with this kind of cross-validation is that the sign 
for each term in the combined SVD is not well determined. Eastment and 
Krzanowski (1982) chose the sign of the kth outer product so that its upper 
left element has the same sign as that from a full SVD fit to the original 
data. As such, the method does not completely hold out Xu . 

The approach of Gabriel (2002) described in Section 1 does not require 
looking at Xu , and its cross- validated squared errors seem to give reasonable 
nonmonotonic answers in the crop science applications. Accordingly, this is 
the cross-validation method that we choose to generalize. 

3.2. Leaving out an r by s submatrix. Suppose that we leave out an r 
by s submatrix of the m by n matrix X. For notational convenience, we 
suppose that it is the upper left submatrix. Then we partition X as follows 

(3.1) X=(^ ^), 

where AeW', Be R'^x C G rC^-^xs ^nd D G ]^{m~r)x{n-s) _ 

Lemma 1 (Self consistency). Suppose that X G R'^xn fiQyijig rank k is 
partitioned as in equation (3.1) and that the matrix D g r('"~''')x("-«) ^p. 
pearing there also has rank k. Then 

(3.2) A = 5L>+C = 5(l)W)+C7. 

Proof. Because D has rank k, we find that D = D^''^ and so we only 
need to prove the first equality. If /c = 0, then A = BD^C = holds trivially, 
so we suppose that /c > 0. 

We write the SVD oiXasX = Y!1=i (TiUiv[ = C/SF', for S = diag((Ti, . . . , cj^) 
with 0-1 > <T2 > • • • > o-fc > 0, U £ R""^*^ and V £ W""". Let Ui G R^^^ con- 
tain the first r rows of U and U2 G R{"^-'^')xfe contain the last m — r rows. 
Similarly, let Vi contain the first s rows of V and V2 contain the last n — s 
rows of V . Then 

A = Ui^Vi, B = UiY^V2, C=U2^Vl and D = f/aSFa'- 

Let D = LR, where L = U2S and R = SV2, for 5 = diag(i/ai, . . . , y/ok). 
Then by MacDuffee's theorem, 

D+ = R+L+ = R'{RR'y^{L'L)-^L' = V2S{SViV2S)-^{SU!2U2S)-^ SU!^ 

= V2{yiV2)-^i^'\u'2U2r^u'2. 
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Substituting this formula for into BD^C and simplifying gives 
BD^C = (C/iSl/2')^2(V"2^2)"^S-i(C/^C/2)~^C/2(t^2SFi') 

= C/iSF/ = ^. □ 

Lemma 1 would have been quite a bit simpler had we been able to write 
D'^ as V2^~^U2. But need not take that simple form, because the de- 
composition D = U2^V2 is not in general an SVD. The matrices U2 and V2 
are not necessarily orthogonal. 

We use Lemma 1 to interpret A — B{D^^^)'^C as a matrix of residuals for 
the upper left corner of X with respect to a model in which X follows an 
SVD of rank k. To do {h x £)-fold BCV of the SVD, we may partition the 
rows of X into h subsets, partition the columns of X into I subsets, and 
write 

BCV(fc) = X:EPai)-i?(i,j)(5(^,i)('^)+C(i,i)||^, 

i=l j=l 

where A[i^j) represents the held out entries in bi-fold and similarly for 
B{i,j), C{i,j) and D{i,j). 

There are other ways to define residual quantities. When both X and D 
have rank k, then A — BD^C = 0. We could replace any or all of A, C 
and D by a rank k approximation and still get a matrix of Os for X and D 
of rank k. Assuming that we always choose to use there are still 

eight different choices for residuals. The simulations in Section 7 consider 
residuals of the following two types: 

(3.3) (I) ^-B(Z)W)+C, 

(3.4) (II) A-B^^\b^^^)^C^^\ 

In the next section we show how residuals of the first type above correspond 
to a cross-validation of principal components regression (PGR). 

3.3. Cross-validation of regression. In multivariate multiple regression 
we have a design matrix Z S and a response matrix Y G M'"^* to be 

predicted by a regression model of the form Y = Zf3, where (3 G M"^*. The 
case of ordinary multiple regression corresponds to s = 1. In either case the 
least squares coefficient estimates are given by /3 = {Z'Z)~^Z'Y, assuming 
that Z has full rank, and by /? = Z^Y more generally. 

If we leave out the first r rows of Z and Y and then predict the left out 
rows of y, we get a residual 

Yi.,r,i..s-Zr,r,i.J = A-BD+C, 
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in the decomposition 

(3.5) {Y Z)=(^^'--''''--' Zi.,r,l:n \ 

\^{r+l):m,l:s ^ [r+l) : m,l: n J 

From this example, we see that BCV changes regression cross-vahdation 
in two ways. It varies the set of columns that are held out, and it changes 
the rank of the linear model from the number of columns of Z to some other 
value k. The regression setting is much simpler statistically, because then 
the matrix D to be inverted is not random. 

The generalized Gabriel bi-cross-validation we present can now be de- 
scribed in terms of ordinary cross-validation of PGR. Suppose that we hold 
out the first r cases and fit a regression of ^(r+i) : m,i : s on the first k principal 
components of .^(r+i) : m,i -.n- The regression coefficients we get take the form 
0^^))+C. Then the predictions for the held out entries are A = B{D^''^)'^C. 
Thus, conditionally on the set of columns held out, BGV is doing a GV of 
PGR. 



3.4. Separate row and column deletion. The method in Eastment and 
Krzanowski (1982) does not generally have the self-consistency property. 
Gonsider the rank 1 matrix 

X = auv = a ' ' ' 



where a > 0, uq eR, ui e ]R'"~\ vq^R and vi G M"-^ 
1. Leaving out column 1 of X and taking the SVD yields (T(ibti)(ibt>i)' = 
((T||ui||)(ibu)(ib?;i/||ui||)'. The signs in these two factors must match but can 
be either positive or negative. Similarly, leaving out row 1 of X and taking 
the SVD yields (fT||ni||)(±ni/||'Ui||)(±^;)'. 
Gombining these two parts yields 

_Lll ll-l/2|| 11-1/2 / I / 1 

± ui ' \\vi\\ ' auv = ±auv - 



^{l-ul){l-vl) 



The sign is arbitrary because the choices from the two SVDs being combined 
might not match. Even with the correct sign, the estimate is slightly too 
small in magnitude. 

4. Exceptions. Lemma 1 has a clause that requires the submatrix D to 
have rank k just like the full matrix X. This clause is necessary. For example, 
the "spike matrix" 



(4.1) X 



/I 
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0^ 
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clearly has rank 1, but any submatrix that excludes the upper left entry has 
rank 0. Thus, A ^ BD^C for this matrix whenever A contains the upper 
left entry. Clearly, BD^C equals zero for this matrix. The squared error 
from fitting the true rank fc = 1 is the same as that from fitting A; = 0. 

One could not reasonably expect to predict the 1 in the upper left corner 
of the spike matrix in (4.1) from its other entries that are all Os. In this 
instance, the exception seems to provide a desirable robustness property. 

Another exception arises for a "stripe matrix" such as 



(4.2) 



X 





1 


1 


1^ 






































Vo 








0^ 







1 1 1 1), 



In this case, predicting a 1 for the upper left entry from the rest of the data 
seems slightly more reasonable, because that is what a continuation of the 
top row would give. Of course, a continuation of the left most column would 
give a prediction of 0. For this case it is not so clear whether the upper left 
value should be predicted to be a or a 1 from the other entries. In crop 
science applications, the row might correspond to a specific genotype or 
phenotpye and then it would have been caught by an additive or regression 
part of the model. Put another way, the residual matrices that arise in crop 
science do not generally contain additive main effects of the type shown 
above. However, in other applications, this sort of pattern is plausible. 
Yet another exception arises for an "arrow matrix" such as 



(4.3) 



X 





1 


1 


1\ 













1 











Vi 








0/ 




1 



1 



1 



This matrix has rank 2, but the upper left entry will not be correctly pre- 
dicted by the formula BD^C . In this case a value of 1 seems like a plausible 
prediction for the upper left element based on all the others. It fits a multi- 
plicative model for elements of X. But it is not the only plausible prediction 
because an additive model would predict a 2 for the upper left entry. 

It is clear that if the formula BD^C is to match A, that the singular 
value decomposition of X when restricted to the lower right m — r hy n — s 
submatrix must not become degenerate. The columns of U2 S x from 

the proof of Lemma 1 must not be linearly dependent, nor can those of 
V2 G R^"'"*)^'"" be linearly dependent. 

If we hold out r rows and s columns, then some outer product feature, 
which affects fewer than r + 1 rows or s -|- 1 columns, can similarly be missed. 
Therefore, features of the low rank approximation to X that are not suffi- 
ciently broadly based are not faithfully represented in bi-cross- validation. In 
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the case of spikes and stripes, this seems to be a desirable noise suppression 
property. If the data set features a large number of very small tight clus- 
ters, such as near duplicates of some rows, then bi-cross-validation might 
underestimate k. 

This property of bi-cross-validation is not a surprise. A similar phenomenon 
happens in ordinary cross-validation. If leaving out a subset of the predictors 
causes the design matrix to become singular, then cross- validating a correct 
model with no noise will not give a zero residual. 

If one is willing to forgo the robustness of ignoring spikes and is concerned 
about missing large but very sparse components in the SVD, then there 
is a remedy. Let Ol and Or be uniform random orthogonal matrices of 
dimensions mxm and nx n respectively. Then X = OlXO'j^ = OlUTjV'O'^ 
has the same singular values as X but has no tendency to concentrate the 
singular vectors into a small number of rows or columns. The BCV of X 
is equally sensitive to outer products auv' with sparse u and v as with 
nonsparse u and v. 

5. Cross-validating the NMF and other outer product decompositions. 

In the nonnegative matrix factorization of Juvela, Lehtinen and Paatero 
(1994) and Lee and Seung (1999), X is approximated by a product WH^ 
where W G [0,cx))™'^*^ and H E [0, oo)'^^'^. That is, the matrices W and H 
are constrained to have nonnegative entries. Ordinarily, both factors of X 
have rank k. We will suppose that W and H are chosen to minimize ||^ — 
I^iJ|||., although other objective functions are also used and can be bi-cross- 
validated. The estimated NMF often has many zero entries in it. Then the 
rows of H can be interpreted as sparsely supported prototypes for those of 
X, yielding a decomposition of these rows into distinct parts [Lee and Seung 
(1999)]. 

There are several algorithms for computing the NMF. We use alternating 
constrained least squares. Given VF, we choose H to minimize ||X — Vl^ff||p, 
over [0,00)*^^", and given we choose W to minimize ||X — over 
[0,oo)™'^''\ Unlike the SVD, there is no certainty of attaining the desired 
global optimum. 

5.1. Residuals for the NMF. To bi-cross-validate the NMF, we proceed 
as for the SVD. Let X be decomposed into parts A, B, C and D as before, 
and suppose that X = WH, where W e [0,oo)"^^'' and H G [0,oo)''^" both 
have rank k. Then if the retained submatrix D also has rank k, we once 
again find A = BD^C by applying the self-consistency Lemma 1, followed 
by MacDuffee's theorem. Indeed, 

(5.1) A = BiD^'^Yc = B{WdHd)^C = {BH+)(W+C), 
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where D = WoHu is the NMF for X restricted to the submatrix D. This 
process leads to the residual matrix 

(5.2) A-W^''H^^\ 

where W^^^ = B0^'^)+ and = {wP)+C, and D = h'^;'^ is an 
NMF fit to D. We refer to (5.2) as the "simple" residual below. It vanishes 
when a fc-term NMF holds for both X and D. 

In general, the left and right factors W\ and in (5.2) need not 

have nonnegative elements, especially when the rank of X is not equal to 
k. It seems preferable, at least aesthetically, to require both factors to have 
nonnegative entries. There is a natural way to impose constraints on these 
factors. The product {W^^)^C has a least squares derivation which we can 
change to constrained least squares, replacing 

= argmin||C7 - w}^^ H\\l by 
h'x^ = argmin \\C-Wi^^H\\l. 

//G[0,oo)'=x'' 

We similarly replace W^^'^ by 

W^'^= argmin ||B - VF^g"^ ||^. 

iye[o,oo)'-xfe 

The residual we use is then 

(5.3) A-W^^'^H^^l 

We call (5.3) the "conforming" residual. Unlike simple residuals, the con- 
forming residuals can depend on which factorization of D is used. If X has 
a k term NMF and the submatrix D has an NMF unique up to permutation 
and scaling, then the conforming residual is zero. When D has a A:-term 
but inherently nonunique NMF, then it may be possible to get a nonzero 
conforming residual. Laurberg et al. (2008) give necessary and sufficient 
conditions for the NMF to be unique up to permutation and scaling. 

Figure 1 illustrates the BCV process for the NMF. Ordinarily there are 
h distinct row holdout sets and i distinct column sets and then there are 
M = hi holdout operations to evaluate, but the algorithm is more general. 
For example, one could choose M random r x s submatrices of X to hold 
out. 

Requiring nonnegative factors in the estimate of A is not analogous to 
replacing the type I residuals (3.3) by type II residuals (3.4). The analogous 
modification would be to replace the submatrices B and C by their rank k 
NMF fits. We do not investigate this option. 

The examples from Section 4 have implications for bi-cross-validation of 
the NMF. lfX = WH = J^Li WiK and one of the vectors Wi or hi is almost 
entirely zeros, then that factor will be harder to resolve. 
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Given: 

A matrix X € [0, cxs)'" '^", row and column holdout subsets Te C {1, . . . ,m}, 
Se C {1 , n} for ^ = M, and a list of ranks >C C {1, . . . , min(m, n)}. 

Do: 

1) For kefC: BCV(fc)^0 

2) For £€ {!,..., L} and fc £ AT: 

3) I ^le and J ^ Je 

4) Fit NMF: X-i,-jr = wlj'^__y/fi*;^ 

5) ^ argmin„.e[o,«,),->„!-„ j - WHI%_^ \\% 

6) /f^'^!, ^ argmin„e[o,oo)™-'-x- tl^-i.J " "'Iz -j^IIf 

8) BCV(A') ^ BCV(A:) + l\Xx,j - X^^''lr\\%. 
Return: 

BCV{fc) for fc € JC. 

Fig. 1. This algorithm describes bi-cross-validation of the nonnegative matrix factoriza- 
tion. It uses a squared error criterion and conforming residuals (5.3). To use the simple 
residuals (5.2), replace lines 5 through 7 by X^'lj ^ Xi^-j{H^^'^ _j)^{w!i^ _j)~^X-x^j. 



5.2. Other outer product models. Other models of outer product type 
are commonly used on matrix data. Lee and Seung (1999) point out that 
the model underlying A;-means clustering of rows has the outer product form 
X = LR, where L G {0, 1}™^*^ with each row summing to 1 while R G M'^^'". 
It may thus be bi-cross-validated in a manner analogous to the NMF, re- 
specting the binary constraints for L. It may seem odd to leave out some 
variables in a cluster analysis, but Hartigan (1975) has advocated leaving 
out variables in order to study stability of clustering, so bi-cross-validation 
of fc-means may be useful. Clusters that involve fewer rows than the number 
being held out could be missed. 

The semi-discrete decomposition of Kolda and O'Leary (1998) approx- 
imates X by a matrix of the form UT,V' , where U G {—1,0,1}™^*^, V € 
{—1,0,1}"^'^ and S is a nonnegative k hy k diagonal matrix. MacDuffee's 
theorem applies to this decomposition because we can absorb S into one 
of the other factors. Thus, a residual like the one based on (5.2) can be 
constructed for the SDD. 

6. Random matrix distribution theory. We suppose here that X is a 
rank or 1 matrix plus i.i.d. Gaussian noise. While we expect BCV to 
be useful more generally, this special case has a richly developed (and still 
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growing) theory that we can draw on to investigate BCV theoretically. The 
Gaussian setting allows special methods to be used. Some of the random 
matrix theory results have been extended to more general distributions, 
usually requiring finite fourth moments. See Baik and Silverstein (2004) and 
Soshnikov (2001) for some generalizations. 

Section 6.1 considers the case where the true k = and we fit a model of 
either rank or 1. Results on sample covariances of large random matrices 
are enough to give insight into the hold-out squared error. The case where 
the data are generated as rank 1 plus noise is more complicated, because it 
cannot be handled through the sample covariance matrix of i.i.d. random 
vectors. Section 6.2 reviews some recent results of Onatski (2007) which 
apply to this case. Then Section 6.3 applies these results to our partitioned 
matrix. Section 6.4 organizes the findings in a 2 x 2 table with true and 
fitted ranks both in {0, 1}. 

For simplicity, we look at the differences between E(BCV(fc)) for the cor- 
rect and incorrect k. When these are well separated, then we expect the 
method to more easily select the correct rank. For this analysis, we neglect 
the possibility that in some settings using the correct rank may give a greater 
squared error. 

6.1. Pure noise. Throughout this section Xij are independent A^(0, 1) 
random variables. There is no loss of generality in taking unit variance. We 
partition X into A, B, C and D as before, leaving out the r x s submatrix 
A and fitting an SVD of rank k to D. The true A; is and we will compare 
fits with A; = to fits with k = 1. 

For this pure noise setting we will assume that m>n. This can be ar- 
ranged by transposing X and simultaneously interchanging r and s. 

For k = 0, we trivially find that D^^^ =0 and so the error from a rank fit 
is ?(°) = A-B{D^^'^)+C = A. For k = l, the residual is e^^) = A- B{D^^'^)'^C . 
Because A is independent of B, C and D, we easily find that 

E{\\e('^\l)=E{\\A\\l)+E{\\B{m)-'C\\l) > E{\\A\\l) =E{\\e<'^\\l). 

The true model has an advantage. Its expected cross- validated error is no 
larger than that of the model with k = l. 

The rank 1 approximation of X is comparatively simple in this case. We 
may write it as 

1(1) 

= aiuv' . Then, from Muirhead (1982) we know that 
u, V and ai are independent with u uniformly distributed on the sphere 
= {x £ I x'x = 1}, and v uniformly distributed on S"""^. 
Let o"! be the largest singular value of X. Then af is the largest eigenvalue 
of X'X . Such extreme eigenvalues have been the subject of much recent 
work. The largest has approximately a (scaled) Tracy-Widom distribution 
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am,n = Wm - 1 + \/n) 



Wi. Specifically, let 

l^m,n = Wm - 1 + ^/nf and 

\JrrL - 1 \/n) 
Then from Theorem 1 of Johnstone (2001), 

2 

as m and n go to infinity in such a way that m/n ^ €> 1. For our purposes 
it will be accurate enough to replace the (m — l)'s above by m. While /im,n 
grows proportionally to m, am,n grows only like m}/^, so al is relatively 
close to its mean. 

Putting this together, B{D^^'^)+C = {Bu)a^^{v'C), where Bu G and 
v'C G M^^* both have i.i.d. J\f{0, 1) entries, independent of cJi, and so 

E(PpW)+C||^) = rsE(0. 

The random variable af is stochastically larger than a X(m) random 
variable and so E((Tf^) does exist. Also, am,n/fJ'm,n becomes negligible as 
m,n^ oo. Therefore, E(o"j~^) w n^-r,n-s = (V^ ~ + ~ -s)"^! recalling 
that the SVD is applied to D which has dimensions (m — r) x (n — s). 

When we partition the rows of X into m/r subsets and the columns into 
n/s subsets and then average the expected errors, we get 

— E(BCV(0)) = 1 and 

(6.1) 

— E(Bcv(i)) w 1 + (V^^r^+ ^/^r^)"l 

mn 

The bi-cross- validated squared error is, on average, slightly larger for k = 1 
than for k = 0. As equation (6.1) shows, the expected difference between 
BCV(l) and BCV(O) grows if r/m and s/n are taken larger. Thus, we expect 
larger holdouts to be better for avoiding overfitting. 

6.2. Rank 1 plus noise. Now we suppose that 
(6.2) X = Kuv' + Z 

for unit vectors u G and v G M", and a constant k > 0. Here kuv' is a 
rank 1 signal obscured by the noise, Z. The elements Zij of Z are i.i.d. 
Ar(0, 1) random variables. If we fit a rank model, then the expected mean 
squared error is E(BCV(0)) = n^ + mn. The root mean square of the elements 
in the signal is K{rnn)~^/'^ and so it is reasonable to consider n increasing 
with m and n. 



BI-CROSS- VALIDATION OF SVD AND NMF 



15 



When exactly one of u and v is fixed and the other is random, then model 
(6.2) is a special case of the spiked covariance model of Johnstone (2001). 
The spiked covariance model has either independent rows and correlated 
columns or vice versa. If columns are independent, then XX' is a spiked 
covariance, and theoretical results on its first eigenvalue apply directly to 
the left singular vector of X . Conversely, independent rows let us approach 
the left singular vector of X via X'X. But we need both left and right 
singular vectors of X and we cannot assume that both rows and columns 
are independent. 

Some recent results due to Onatski (2007) allow us to investigate model 

(6.2) where both u and v are fixed. Either or both could be random as well. 
We specialize the results of Onatski (2007) to the case of k = l determinis- 
tic factor with noise level a = 1 and deterministic loadings. His deterministic 
factor F is our vector v^/n. His loadings L then correspond to our Ku/y/n. 
Onatski (2007) studies a "weak factor model" whose small loadings are such 
that L'L = ii?/n approaches a constant di. The weak factor model has k 
growing slowly with n. Earlier work of Bai (2003) used "strong factors" 
with much larger loadings, in which L'L/n approaches a constant. 

Theorem 2. Let X = kuv' + Z, where u G and u € are unit 
vectors, and Zij are independent M{0, 1) random variables. Suppose that 
m and n tend to infinity together with m — cn = o(m~^/^) for some c G 
(0, oo). Let ii be the largest eigenvalue of XX' /n. If di = k'^ /n > ^/c, then 
y/n{ii — mi) — >AA(0, S^) in distribution, where 

(di + l)(di+c) 2idj-c) ^^^ 

mi = ; and = ^ (2di + c+l). 

di df 

Proof. This is from Theorem 5 of Onatski (2007). □ 

When di < \fc there is a negative result that simply says l\ tends to 
(1 + \/c)^ regardless of the actual value oi d\ = k?' jn. 

The critical threshold is met when d\ > i/c. That is, K^/n > ^/c = y^m/n, 
and so the threshold is met when 

(6.3) = 6\/mn for 5 > 1. 

The critical value for k is unchanged if we transpose X, thereby switching 
m and n. We will consider large but finite 5. 

The top singular value of X is ai and af = nli , which grows as m and n 
grow. We scale it by which grows at the same rate, and find 

IE(<7f) ^ nE(li) ^ nmi ^ n(di + l)(di + c) ^ / ^ ^\ / ^ y/c\ 

K? dlK^ V ^\/c/ \ ^ /' 



16 



A. B. OWEN AND P. O. PERRY 



after some simplification. 

The first singular value ai then scales as a multiple of k. That multiple 
does not tend to one in the weak factor limit. But if 6 is large, the bias is 
small. 

The variance of aH k? = {^/n/ k?) x y/nli is 



2 ^ ^ 1 

'26 + .fc + 



after some simplification. Treating d\ as an estimate of k^, we find that 
the ratio has a standard deviation that is asymptotically negligible 

compared to its bias. 

Onatski (2007) also considers the accuracy with which the vectors u and 
V in our notation are estimated. 

Theorem 3. Under the conditions of Theorem 2, let u be the first eigen- 
vector of XX' /n and v be the first eigenvector of X'X/m. If di > ^/c, then 

E^^/^(ii'u-/i(7)^AA(0,l) and - /iy) ^ 7^(0, 1), 

where 



dj-c , dl 

h''U = \ -tTj TT ^'^w fJ-V ~ 



di{di + l) \di{di+c) 
and as di becomes large, 

max(S[/, T,v) = 0((if ^). 

Proof. These results come from Theorems 1 and 2 of Onatski (2007). 
The values for jiu and fiy are found by direct substitution. From part c of 
his Theorem 2 with {i,j) = {t,s) = (1, 1) and 0iiii = 0, we find 

Cdiidi + 1)2 ^ fdi + {{d, + 1)2 - (1 - c))2c2 



2{di+c){dl-c)^\ \di+cj J 2di{dl-c){di+c)^ ' 

By inspecting the powers of di inside this expression for Sy, we may ver- 
ify that Sy = 0((ij~2) as di co. From Onatski's Theorem 1, part c, we 
similarly find 

^ {c^ + di){di + l) rfl(c-l) [(di + l)2-(l-c)]2di 

^ 2di{dj-cy 2{dl - c){di + 1) 2(d2-c)((ii + l)3 • 

The expression for Sy can be rewritten as a ratio of two polynomials. At first 
they both appear to be of eighth degree. But upon inspection, the coefficients 
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of di and ({[ vanish from the numerator and the result is asymptotic to 
5dj^^/2 as di grows. □ 

Onatski's quantity di equals K?/n = It follows that both [lu and \xy 
become close to 1 for large 5, as m and n increase. Similarly, and Ey 
have small limits when b is large. 

6.3. Partitioning X . We write X E in the form 

where A G W"^'' and the other parts of X of conforming sizes. The entries 
of Z are independent Gaussian random variables with mean and variance 
1. We will compare rank and 1 predictions of A, denoted by A'^'^'^ and A^^'^ 
respectively. 

These estimates take the form Bg{D)C, where g{D) G may 
be thought of as a regularized generalized inverse. For 1(0) 

we take g{D) = 0, 

while for l(i) we take g{D) = (Z)(i))+ = K+ua*^. 

The expected mean square error from the rank model is 

E{\\Afp I u,v,Z22)=E{\\Afp I u,v)=rs + K^\ui\'^\vi\'^. 

Summing this MSE over m/r hold out row sets and n/s hold out columns 
sets, we get E(BCV(0)) = mn + . For weak factors, the expected cross- 
validated error of the null model is mn + b\Jmn. 

For the rank 1 model we build some machinery. Lemma 2 below integrates 
out the contributions from Zn, Z\2 and Z^x for u, v and Z22 fixed by sam- 
pling or by conditioning. Lemma 2 also applies to functions g corresponding 
to SVDs of rank higher than 1 and to the nonnegative matrix factorization, 
among other methods. 

Lemma 2. Let X he as partitioned above with matrix Z having inde- 
pendent entries with E,{Zij) = and Y{Zij) = 1. Let g{-) be a function from 

^(m-r)x(n-s) j^(n-s)x{m-r) let G = g{D) . Then 

¥.{\\A- BGC\\l\ Z22,u,v) 

= rs + K^\\uiv'i — KUlv'2Gu2v'i\\\ + SK^||uiW2G|||' 

+ rK'^\\Gu2v[fp + rs\\Gfp 
<rs + — K?;2Gu2)^|uiP|fiP 

I Il/^ll2 / 2 I |2| |2 I 2 I |2| |2 I \ 

+ ||G||^(k s|mi| 1^21 +K r\u2\ \vi\ +rs). 
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Proof. Decompose 

A — BGC = KUiv\ + Zii — {nuiv'2 + Zi2)G{ku2v[ + Z21) 

= Zii - KU1V2GZ21 - kZi2Gu2v[ - Z12GZ21 
+ KUiv'i — K^Ulv'2Gu2v'i 

and sum the squares of the terms to get the equahty. Then apply bounds 
\Xy\ ^ II-^^IIfIi/I for matrix X and vector y. □ 

Next we look aX D = KU2V2 + Z22 and apply the results from Onatski 
(2007) in Section 6.2. Let the first term in the SVD of D be ku2V2- Then for a 
rank 1 fit g{D) = k~^V2U2, and so ||G'|||;' = and KV2GU2 = {K/k)v2V2U2U2- 
From here we proceed informally in averaging out .^22- 
From Theorem 2 we have > with overwhelming probability be- 
cause the bias dominates the variance. Therefore, we will suppose that 
E{\\G\\Ik'^) = < 1. Then 

E{\\A - BGCfp) <rs + k2E22((1 - k7;^G'U2)^)|ui|>i|2 

+ s|uip|t;2p + ?^|u2|^|t'iP + rsK'"^ , 

where E22 denotes expectation over Z22 with u and v fixed. If we sum over 
m/r disjoint row blocks, and n/s disjoint columns blocks, then 

E(BCV(1)) < mn + Impl^il^sCll - A^t^^Gns)^) + £!! + ™ + !!^^ 

T S Ki 

Ml Vl 

where the summation is over all {mn)/{rs) ways of picking the rows and 
columns to hold out as A. Because u and v are unit vectors, we always have 

\U2\, \V2\ < 1- 

Next we turn to (1 — kv2Gu2)'^ ■ Define unit vectors U2 = ^^2/ 1^^215 and 
V2 = 1^2/1^^211 and let k = K|n2||f2|- Then D = KU2V2 + ^22 = ku2V2 + .^22- 
The matrix D now plays the role of X in Theorems 2 and 3. Let D have 
the largest singular value k with corresponding singular vectors U2 and V2, 
so that G = k~^V2U2- Then 

(6.4) KVnGu2 = —V2V2U2U2 = 1 — n — r~'^2'f^2^^2^2- 

Equation (6.4) shows where problems arise if we have held out most of the 
rows dominating u. Then is large and \u2\ = \/l — \ui\^ is small, and of 
course a similar problem arises when vi has most of the structure from v. 

From Theorems 2 and 3 we know that ik / k)v2V2u'2U2 will be quite close 
to 1, for large 5. We do not make a delicate analysis of this because we want 
to consider settings where \u2\ and \v2\ need not be close to 1. We suppose 
only that |n2p > ?? and \v2\^ > rj for some rj > 1/2. For example, if 7/ = 3/4, 
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then we never hold out more than 1/4 of the squared signal in n or u . This 
is a very conservative condition when r and s are small such as r = s = 1. 
Now our conservative estimate for E22((l — kv2Gu2)'^) is simply (1 — 
Under this very conservative assumption, we get 



r s 



E(P - BGCWl) < mn + - l/r?)^ + Z!l + ^ + 11^ 

(6.5) 

= mn + 6\ 



6 

To see why the assumption is conservative, consider a model in which the 
components of u and v were generated by sampling AA(0, 1) entries which 
are then normalized and fixed by conditioning. Suppose further that r and 
s are small. Then is about \/l — r/m. Similarly, \v2\ is about \/\ — s/n 
and in combination with large 5, we find the coefficient E22((l — nv'2Gu2)^) 
of byjmn in (6.5) nearly vanishes. 

6.4. Summary. We summarize the results of this section in a 2 x 2 layout 
given in Table 1. For simplicity, we consider proportional holdouts with 
r/m = s/n = 6. There is an unavoidable contribution of mn to the squared 
errors coming from the noise Zn in the held out matrix A. If the true k = 0, 
and we fit with k = then that is all the error there is. Accordingly, we place 
a in the upper left entry of the table. 

If the true A; = and we fit with k = 1, we get a squared extra error of 
the form 

mn 1 Jmn 



[yf^^^^^P^sf 1-9 

This provides the lower left entry in Table 1. Larger holdouts, as measured 
by 9, increase the separation between rank and rank 1, when the true rank 
is 0. Similarly, extreme aspect ratios, where c is far from 1, decrease this 
separation, so we expect those aspect ratios will lead to more errors in the 
estimated rank. 

Table 1 

This table summarizes expected cross-validated squared errors from the text. The lower 
right entry is conservative as described in the text. The value rj £ (1/2,1) represents a 
lower bound on the proportion not held out, for each singular vector u and v. The value 9 
is an assumed common value for r/m, and s/n and 5 > 1 is a measure of signal strength 



E(BCV(fc)) - mn 


True fe = 


True fe = 1 


Fitted fc = 
Fitted fc = 1 




1 \Jm.n 

^+1/^+2 


i^frrvn 

^/^{&{\ - \/r)f + c=*/2 + c-^*/" + 1/5) 
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Now suppose that the true model has rank k = 1 with a value of k sat- 
isfying = 5y/rrm for (5 > 1. If we fit a rank model, we get an expected 
summed squared error of mn + b^fmn. This provides the upper right entry 
of Table 1. 

Finally consider the case where we fit a rank 1 model to a rank 1 dataset. 
For simplicity, suppose that we never hold out most of u or v. Then the 
retained portion of u has ^2^2 > 77 > 1/2 and, similarly, v'2V2 > ?? > 1/2. For 
r/m = s/n = 6, equation (6.5) simplifies to 

mn + 6^/mn(l - l/rif + y/mnie'/'^ + c~^/^) + 



which gives us the lower right entry in Table 1. 

We want the lower right entry of Table 1 to be small to raise the chance 
that a rank 1 model will be preferred to rank 0. Once again, large aspect 
ratios, c» 1, are disadvantageous. The holdout fraction 9 does not appear 
explicitly in the table. But a large holdout will tend to require a smaller 
value for 77 which will then raise our bound BCV(l). 

In summary, large holdouts make it easier to avoid overfitting but make 
it harder to get the rank high enough, and extreme aspect ratios make both 
problems more difficult. 

7. Examples for the truncated SVD. In this section we consider some 
simulated examples where X = ^ + Z for which Z ~ AA(0, Im ® In) is a matrix 
of i.i.d. Gaussian noise and /i € M™^" is a known matrix that we think of as 
the signal. The best value of k is A;°p* = argmin^ — and we can 

determine this in every sampled realization. For a data determined rank k, 

we can compare H^'-'^^ll to 

We generate the matrix /i to have pre-specified singular values ti>T2> 
■ ■ ■ > T^in{m,n) > as follows. We sample Y ~ AA(0,/m ® In), fit an SVD 
Y = UT,V' and then take n = UTV' , where T = diag(r). 

We look at two patterns for the singular values. In the binary pattern, r oc 
(1, 1, . . . , 1, 0, 0, . . . , 0) including k nonzero singular values. In the geometric 
pattern, r oc (1, 1/2, 1/4, . . . ^2~™™(™'")). Here fx has full rank, but we expect 
that smaller k will lead to better recovery of We make sure that the 
smallest singular values of fj, are quite small so that our identification of fi 
as the signal is well behaved. 

7.1. Small simulated example. We first took m = 50 and n = 40. The 
singular values were scaled so that = E(||Z|p), making the signal equal 
in magnitude to the expected noise. For this small example, it is feasible to do 
a 1 by 1 holdout. Gabriel's method requires 2000 SVDs and the Eastment- 
Krzanowski method requires 4001 SVDs. The results from 10 simulations 
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Fig. 2. This figure shows the results of Gabriel and Eastment-Krzanowski 1x1 hold out 
cross-validation on some 50 by 40 matrix examples described in the text. In the left panel 
the signal matrix has 10 positive and equal singular values and 30 zero singular values. The 
dotted red curves show the true mean square error per matrix element HX'*^' — /i|p/(mn) 
from 10 realizations. The solid red curve is their average. Similarly, the black curves show 
the naive error — X||^/(mn). The green curves show the results from Eastmen- 

t-Krzanowski style cross-validation. The blue curves show Gabriel style cross-validation. 
The right panel shows a similar simulation for singular values that decay geometrically. In 
both cases the mean square signal was equal to the expected mean square noise. 



are shown in Figure 2. For the binary pattern, an oracle would have always 
picked the true rank k = 10. For the geometric pattern, the best rank was 
always 3 or 4. The Eastment-Krzanowski method consistently picked a rank 
near 20, the largest rank investigated. The Gabriel method tends to pick a 
rank slightly larger than the oracle would, but not as large as 20. 

We also investigated our generalizations of Gabriel's method from Sec- 
tion 3.2 for holdouts of shape 2 x 2, 5 x 5, 10 x 10 and 25 x 20. These have 
residuals given by equations (3.3) and (3.4). For an r x s holdout, the rows 
were randomly grouped into r subsets of m/r rows each and the columns 
were grouped into s subsets of n/s columns each. 

For comparison, we also generalized the Eastment-Krzanowski method to 
r by s holdouts. As for 1x1 holdouts, one takes the right singular vectors 
of an SVD on X less r of its rows, the left singular vectors of an SVD on X 
less s of its columns and assembles them with the geometric means of the 
two sets of singular vectors. The only complication is in choosing whether 
to reverse any of the k signs for the singular vector pairs. Let = Wfcf^ 
be the A:th term of the SVD fit to X and let Mk = Ukv'k be the estimate 
formed from the two SVDs where part of X was held out. Let M^. M^, be 
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the componentwise product of these matrices. If the mean of over 
the rs held out elements was negative, then we replaced UkVk by —UkVk- 

A 2 X 2 holdout requires roughly one fourth the work of a 1 x 1 holdout. 
For comparison we also looked at replicating the 2x2 holdout four times. 
Each replicate had a different random grouping of rows and columns. We 
also replicated the 5x5, 10 x 10 and 25 x 20 methods 25 times each. 

A final comparison was based on three methods from Bai and Ng (2002) 
that resemble the Bayesian Information Criterion (BIC) of Schwarz (1978). 
In these, the estimate k is the minimizer of 

(7.1) BICi(^^)=log(||xW-X 

(7.2) BIC2(A:)=log(||x(^)-X 

(7.3) BIC3(A:)=log(||xW-X 

over k, where c = c{m,n) = mm{^/m, \Ai). 

The methods of Bai and Ng (2002) are designed for a setting where there 
is a true rank k < min(m,n) to be estimated. The binary pattern conforms 
to these expectations but the geometric one does not. In applications, one 
can seldom be sure whether the underlying pattern has a finite rank, so it 
is worth investigating (7.1) through (7.3) for both settings. 

Figure 3 summarizes all of the methods run for this example. BIC methods 

1 and 3 always chose rank k = 20, the largest one considered. BIC method 

2 always chose rank 1 for the binary setting, but did much better on the 
geometric setting, for which it is not designed. This example is probably too 
small for the asymptotics underlying the BIC examples to have taken hold. 

The original Eastment-Krzanowski method picked ranks near 20 almost 
all the time. The generalizations picked ranks near 1 for the binary case but 
did better on the geometric case. 

The lower left region of Figure 3 holds the methods with good relative 
performance in both cases. The methods there are all generalized Gabriel 
holdouts for r x r holdouts with r G {1,2,5, 10}. These methods have nearly 
equivalent performance. Both types I and II residuals are there but for no 
holdout size did type II beat type I, while there were a few cases where the 
sample mean square was smaller for type I. The methods using replication 
got slightly better performance on the binary case but somewhat worse 
performance on the geometric case. There is a slight tendency for larger 
holdouts to do better on the geometric case, and for smaller ones to do 
better on the binary case. The pattern we see is reasonable when we notice 
that the holdout sizes that did poorly for the binary case were not small 
compared to the number of nonzero singular values. 
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Fig. 3. This figure shows the mean square error, per element, for all the methods ap- 
plied to the 50 X 40 example. The case with geometrically decaying singular values is on 
the horizontal axis, and the binary case with ten equal nonzero singular values is on the 
vertical axis. Gabriel's method and our generalizations are shown in blue, (generalized) 
Eastment-Krzanowski is in green, the oracle is red, and Bai and Ng 's BIG estimators are 
in black. There are horizontal and vertical reference lines for methods that always pick 
k = l or k — 20. The cluster of blue points in the lower left corner is discussed in the text. 



The comparisons between I and II at fixed holdout size may not be sta- 
tistically significant. While one could possibly find significance by pooling 
carefully over holdout sizes, the more important consideration is that type 
II BCV is more awkward to implement and appears to perform worse. So 
these results favor type I, as do further results in the next section. The choice 
between Gabriel and EK style cross-validation is important. Similarly, the 
choice of holdout size is important because the largest holdout size did not 
end up among the most competitive methods. 

7.2. Larger simulated example. We repeated the simulation described 
above for X G ]^ioooxiooo_ ^^^^ ^^^^ repeat the Eastment-Krzanowski 
methods because the sign selection process is awkward and the methods are 
not competitive. We did repeat the Bai and Ng methods because they are 
easy to implement, and are computationally attractive in high dimensions. 
We considered both residuals I and II. 
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For the binary example we used a matrix fj, of rank 50. The geometric 
example was as before. We found that taking = E(||Zp) makes for a 
very strong signal when there are 10^ matrix elements. So we also considered 
a low signal setting with = 0.01E(||Z|p) and a medium signal setting 
with ll^p = 0.1E(||Zp). The quantity 5 takes values 20, 2 and 0.2 in the 
high, medium and low signal settings. The low signal setting has 6 <1 and 
so we expect that not even the largest singular vector should be properly 
identified. 

For the generahzed Gabriel BCV we took holdouts of 200 x 200, 500 x 500 
and 800 x 800. We used a single random grouping for each holdout. 

We looked at approximations of rank k from to 100 inclusive. This re- 
quires fitting a truncated SVD of an m — r by n — s matrix to k = 100 terms. 
When k <C min(m — r,n — s) the truncated SVD is much faster than the full 
one. We noticed a large difference using svds in Matlab, but no speed-up 
using svd in R. In some simulated examples comparable to the present set- 
ting, an SVD of X G to k terms takes roughly 0{mnk) computation, 
at least for large m<n. The exact cost depends on how well separated the 
singular values are, and is not available in closed form. A sharper empirical 
estimate, not needed for our application, could use different powers for m 
and n. 

For large k the approximation (D^^^)'^ can be computed by updating 
{D^)~^ with an outer product from the SVD of D. For type II residuals 
similar updates can be used for the pieces B and C oi X. Efficient updating 
of type II residuals is more cumbersome than for type I. 

The outcomes for type I residuals in the two high signal cases are shown 
in Figure 4. The BCV errors for 800 x 800 holdouts were much larger than 
those for 500 x 500 holdouts, which in turn are larger than those for 200 x 200 
holdouts. The selected rank k only depends on relative comparisons for a 
given holdout, so we have plotted BCV(/c; r, s)/BCV(0; r, s) versus A;, in 
order to compare the minima. A striking feature of the plot is that all 10 
random realizations were virtually identical. The process being simulated is 
very stable for such large matrices. 

Comparing the BCV curves in Figure 4, the one for large holdouts is the 
steepest one at the right end of each panel. Large holdouts do better here at 
detecting unnecessarily high rank. For the binary case, the steepest curves 
just left of optimal rank 50 come from small holdouts. They are best at 
avoiding too small a rank choice. 

The type II BCV did not lead to very good results. The only time it was 
better than type I was for 200 x 200 holdouts. There type II matched the 
oracle all 10 times for the low geometric signal. They were slightly better 
than type I BCV 9 times out of 10, for the binary medium signal case. 

The ratio - - /if (or its logarithm) measures the 

regret we might have in choosing the rank k produced by a method. Figure 5 
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shows these quantities for 6 of the methods studied here. It is noteworthy 
that BCV with 500 x 500 holdouts attained the minimum in all 60 simulated 
cases. Smaller holdouts had difficulty with medium strength binary singular 
values and larger holdouts had difficulty with medium strength geometric 
singular values. 

In the binary high signal case the optimal rank, as measured by HX^*^^ — 
/xp, was /c°P* = 50 in all 10 realizations. The other 5 settings also yielded 
the same optimal rank for all 10 of their realizations. Those ranks and their 
estimates are summarized in Table 2. When small holdouts go wrong, they 
choose k too large and conversely for large holdouts, with a slight exception 
in the first row. 

The 200 X 200 holdouts were the only method to get close to the true rank 
50, in the medium signal binary case. Despite getting nearest the true rank, 
they had the worst squared errors of the six methods shown. Ironically the 
BIC methods are designed to estimate rank, but always chose fc = for this 
setting and did better for it. 

7.3. Real data examples. The truncated SVD has been recommended for 
analysis of microarray data by Alter, Brown and Botstein (2000). We have 
applied BCV to some microarray data reported in Rodwell et al. (2004). 



BCV relative errors versus rank 




20 40 60 80 100 20 40 50 80 100 

Truncated rank k Truncated rank k 

Fig. 4. This figure shows the BCV errors for the 1000 x 1000 examples with equal signal 
and noise magnitude, as described in the text. The left panel shows the results when there 
are 50 positive and equal singular values. The horizontal axis is fitted rank, ranging from 
to 100. The vertical axis depicts square errors, in each case relative to the squared error 
for rank k — 0. There are 10 nearly identical red curves showing the true error in 10 
realizations. The black curves show the naive error. Blue curves from dark to light show 
BCV holdouts of 200 x 200, 500 x 500 and 800 x 800. The right panel shows the results 
for a geometric pattern in the singular values. 
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Fig. 5. This figure shows squared errors attained by rank selection methods relative to 
an oracle that knows the optimal rank in each simulated 1000 x 1000 data set. For each 
method there are 60 relative errors corresponding to binary vs geometric singular values, 
high, medium and low signal strength, each replicated 10 times. The methods shown are 
BCV with hx h holdouts for h G {200, 500, 800} and the three BIC methods of Bat and Ng 
(2002). 



The data comprise a matrix of 133 microarrays on 44,928 probes, from a 
study of the effects on gene expression of aging in the human kidney. The 
133 arrays vary with respect to age and sex as well as the tissue type, some 
being from the kidney cortex while others are from the medulla. 

Figure 6 shows the results from bi-cross-validation of the SVD on this 
data using (2 x 2)-fold holdouts. The rank case has such a large error as 
to be uninteresting and so the errors are shown relative to the rank 1 case. 
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Table 2 



This table shows the average rank k chosen in 10 replications of the 1000 x 1000 matrix 
simulations described in the text. There is one row for each singular value pattern. The 
column "Opt" shows argmiiij. IIX'*' — The next three columns are for k chosen via 
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The naive MSE keeps dropping while the BCV version decreases quickly 
for small k before becoming flat. The minimum BCV error over 1 < < 60 
takes place at rank k = 41. 

We have seen similar very flat BCV curves in other microarray data sets. 
Microarray data sets tend to have quite extreme aspect ratios. Here the 
aspect ratio c = m/n is over 300. Simulations with Gaussian data, using 
binary and geometric singular values and more extreme aspect ratios, behave 
like the Gaussian data simulations reported earlier in this section, and not 
like microarray BCV results. 

It is plausible that the biological processes generating the microarray data 
sets contain a large number of features compared to the small number of ar- 




FlG. 6. This figure compares the naive squared error in black to the (2 x 2)-fold bi-cross- 
validation, shown in blue. The ranks investigated vary from 1 to 60. Both error curves are 
normalized by their value at rank k — 1. The BCV curve takes it's minimum at k = 41, 
which is marked with a reference point. 
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rays sampled. The singular vectors in the real microarray data examples 
tend to have high kurtosis, with values in the tens or hundreds. For the 
kidney data the 133 singular vectors of length 44,928 had an average kur- 
tosis of over 180. Such high kurtosis could arise from small intense clusters 
corresponding to features, or from heavy tailed noise. But heavy tailed noise 
values would not be predictive of each other in a holdout, and so we infer 
that multiple sources of structure are likely to be present. 

We also ran BCV to estimate the rank of a truncated SVD for some stan- 
dardized testing data given to us by Ed Haertel. The data set represented 
3778 students taking an eleventh grade test with 260 questions. The data are 
a matrix X G {0, i}3778x260^ where Xij = 1 if student i correctly answered 
question j, and is otherwise. A rank 1 approximation to this matrix cap- 
tures roughly 60% of the mean square. For this data set BCV chooses rank 
14. The NMF is more natural for such data and we discuss this example 
further in Section 8. 

8. NMF examples. We use the CLASSICS corpus from 
ftp://ftp.cs.cornell.edu/pub/smart/ to build an NMF model where 
we know what the true best rank should be. This corpus consists of 3893 
abstracts from journals in three different topic areas: aeronautical systems, 
informatics and medicine. The abstracts are typically 100-200 words, and 
there are roughly the same number of abstracts from each topic area. After 
removing stop words and performing Porter stemming [Porter (1980)], 4463 
unique words remain in the corpus. Ignoring all lexical structure, we can 
represent the entire corpus as a 3893 x 4463 matrix of word counts X™'^, 
whose ij elements is the number of times word j appears in document i. 
This matrix is very sparse, with only approximately 1% of the entries being 
nonzero. 

We construct a simulated data set X from X°^^^ as follows. First we 
split the original matrix into X'^"^ = X'^^^ + X'^^^ + X'^^^^, with one matrix 
per topic area. Then we fit a rank-1 NMF to each topic, = WiHi. 

We combine these models, yielding W = {Wi W2 VF3). The columns of W 
represent 3 topics. We take H S [0, cxd)^^'*^^^ by minimizing IIX™'^ — 
representing each document by a mixture of these topics. 

For i = 1, . . . , 3893 and j = 1, . . . , 4463, we sample Xij independently from 
Poi(^jj) where the matrix ^ equals e + WH. The scalar e is the average value 
of the entries in WH, and its presence adds noise. The signal WH has rank 
3, but in the Poisson context, additive noise raises the rank of /i = K{X) to 
4. 

The true error \\Xi^ — fj.\\p / (mn) between fi and the fc-term NMF approx- 
imation of X is shown in Figure 7. We see that it is minimized at A; = 3. 
BCV curves are also shown there. With (2 x 2)-fold and (3 x 3)-fold BCV, 
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BCV relative errors versus rank 
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Fig. 7. The red curves in the left panel show the true normalized squared error loss 
MSE(fc)/MSE(0) for the estimated NMFs of ranks through 10 fit to the synthetic CLAS- 
SICS data. The black curves show the naive estimates of loss. The blue curves, from dark 
to light, show estimates based on (5 x 5)-fold, (3 x 3)-fold and (2 x 2)-fold BCV. There 
were 10 independently generated instances of the problem. The right panel zooms in on the 
upper portion of the left panel. 



the minimizer is at k = 3. For (5 x 5)-fold BCV, the minimizer is at /c = 4. 
While this is the true rank, it attains a higher MSE by about 50%. Ten in- 
dependent repetitions of the process gave nearly identical results. The BIC 
methods are designed for a Gaussian setting, but it is interesting to apply 
them here anyway. All of them chose A; = 3. 

8.1. Educational testing data. Here we revisit the educational testing 
example mentioned for BCV of the SVD. We apphed BCV to the NMF 
model using a least squares criterion and both the plain residuals (5.2), as 
well as the ones from (5.3) where the held out predictions are constrained 
to be a product of nonnegative factors. 

Figure 8 shows the results. For each method we normalize the BCV 
squared error by the error for a rank model. Surprisingly, the plain 

SVD gives a better hold out error. This suggests that factors not constrained 
to be nonnegative predict this data better, despite having the possibility of 
going out of bounds. The comparison between the two NMF methods is 
more subtle. Both attempt to estimate the error made by NMF on the orig- 
inal data. The less constrained method shows worse performance when the 
rank is very high. While it is a less realistic imitator of NMF, that fact may 
make it less prone to over-fitting. 
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9. Discussion. We have generalized the hold-out-one method of Gabriel 
(2002) for cross- validating the rank of truncated SVD to r x s hold-outs and 
then to other outer product models. 

The problem of choosing the holdout size remains. Holdouts that are too 
small appear more prone to overfitting, while holdouts that are too large are 
more prone to underfitting, especially when the number of important outer 
product terms is not small compared to the dimensions of the held in data 
matrix. Until a better understanding of the optimal holdout is available, we 
recommend a (2 x 2)-fold or (3 x 3)-fold BCV, the latter coming close to 
the customary 10-fold CV in terms of the amount of work required. In our 
examples (2 x 2)-BCV gave the most accurate recovery of /i, except in the 
small SVD example. 

We find that BCV works similarly to CV. In particular, the criterion 
tends to decrease rapidly toward a minimum as model complexity increases 
and then increase slowly thereafter. As a model selection method, CV works 
similarly to Akaike's Information Criterion (AIC) [Akaike (1974)], and dif- 
ferently from the Bayesian Information criterion (BIC) of Schwarz (1978). 
The BIC is better at finding the true model, when it is a subset of those 
considered, while CV and AIC are better for picking a model to minimize 
risk. We expect that BCV will be like CV in this regard. See Shao (1997) 
for a survey of model selection methods for i.i.d. sampling. 

Both cross-validation and the bootstrap are sample reuse methods. When 
there is a symmetry in how rows and columns are sampled and interpreted, 
then one might choose to resample individual matrix elements or resample 
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Fig. 8. This figure plots relative mean squared error under BCV for ranks 1 to 50 on the 
educational testing data. The three curves are for truncated SVD, NMF with least squares 
residuals (5.2), and NMF with nonlinear least squares residuals (5.3). The minima, marked 
with solid points, are at I4, 11 and 17 respectively. 
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rows independently of columns. Some recent work by McCuUagh (2000) has 
shown that in crossed random effects settings, resampling matrix elements is 
seriously incorrect, while resampling rows and columns independently is ap- 
proximately correct. Owen (2007) generalizes those results to heteroscedastic 
random effects in a setting where many, even most, of the row column pairs 
are missing. 

In the present setting it does not seem "incorrect" to leave out single 
matrix elements, though it may lead to over-fitting. Indeed, the method 
of Gabriel (2002) by leaving out a 1 x 1 submatrix does exactly this. Ac- 
cordingly, a strategy of leaving out scattered subsets of data values, as, for 
example. Wold (1978) does, should also work. There are some advantages to 
leaving out an r by s submatrix though. Practically, it allows simpler algo- 
rithms to be used on the retained (m — r)x(n — s) submatrix. In particular, 
for the SVD there are algorithms guaranteed to find the global optimizer 
when blocks are held out. Theoretically, it allows one to interpret the hold 
out prediction errors as residuals via MacDuffee's theorem, and it allows 
random matrix theory to be applied. 

Acknowledgments. We thank Alexei Onatski for sharing early results of 
his random matrix theory, and we thank Susan Holmes, Olga Troyanskaya 
and Lek-Heng Lim for pointing us to some useful references. We benefited 
from many helpful comments by the reviewers at AO AS, who brought the 
papers by Bai and Ng (2002) and Minka (2000) to our attention, and sug- 
gested looking beyond simple type I residuals. Thanks to Ed Haertel for 
supplying the educational testing data. 

REFERENCES 

Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Au- 
tomat. Control. 19 716-723. MR0423716 

Alter, O., Brown, P. O. and Botstein, D. (2000). Singular value decomposition for 
genome-wide expression data processing and modeling. Proc. Nat. Acad. Sci. U.S.A. 
97 10101-10106. 

Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica 71 
135-171. MR1956857 

Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor 
models. Econometrica 70 191-221. MR1926259 

Baik, J. and Silverstein, J. W. (2004). Eigenvalues of large sample covariance matrices 
of spiked population models. J. Multivariate Anal. 97 1382-1408. MR2279680 

Ben-Israel, A. and Greville, T. N. E. (2003). Generalized Inverses: Theory and Appli- 
cations, 2nd ed. Springer, New York. MR1987382 

Besse, p. and Ferre, L. (1993). Sur I'usage de la validation croisee en analyse en com- 
posantes principales. Revue de statistique appliquee 41 71-76. MR1253513 

DOS S. DiAS, C. T. and Krzanowski, W. J. (2003). Model selection and cross validation 
in additive main effect and multiplicative interaction models. Crop Science 43 865-873. 



32 



A. B. OWEN AND P. O. PERRY 



Eastment, H. T. and Krzanowski, W. J. (1982). Cross-validatory choice of the num- 
ber of components from a principal component analysis. Technometrics 24 73-77. 
MR0653115 

ECKART, C. and YouNG, G. (1936). The approximation of one matrix by another of lower 

rank. Psychometrika 1 211-218. 
Gabriel, K. (2002). Le biplot-outil d'exploration de donnees multidimensionelles. Journal 

de la Societe Francaise de Statistique 143 5-55. 
GOLUB, G. H. and Van Loan, C. F. (1996). Matrix Computations, 3rd ed. Johns Hopkins 

Univ. Press, Baltimore, MD. MR1417720 
Hansen, P. C. (1987). The truncated SVD as a method for regularization. BIT 27 534-553. 

MR0916729 

Hartigan, J. (1975). Clustering Algorithms. Wiley, New York. MR0405726 

HOFF, P. D. (2007). Model averaging and dimension selection for the singular value de- 
composition. J. Amer. Statist. Assoc. 102 674-685. MR2325118 

Holmes-Junca, S. (1985). Outils informatiques pour revaluation de la pertinence d'un 
resultat en analyse des donnees. Ph.D. thesis, Universite Montpelier 2. 

Jackson, D. A. (1993). Stopping rules in principal components analysis; A comparison of 
heuristical and statistical approaches. Eeology (Durham) 74 2204-2214. 

Johnstone, I. (2001). On the distribution of the largest eigenvalue in principal compo- 
nents analysis. Ann. Statist. 29 295-327. MR1863961 

JOLLIFFE, I. T. (2002). Prineipal Component Analysis, 2nd ed. Springer, New York. 
MR2036084 

JUVELA, M., Lehtinen, K. and Paatero, P. (1994). The use of positive matrix factor- 
ization in the analysis of molecular line spectra from the thumbprint nebula. Clouds, 
Cores, and Low Mass Stars 65 176-180. 

KOLDA, T. G. and O'Leary, D. P. (1998). A semidiscrete matrix decomposition for latent 
semantic indexing in information retrieval. ACM Transactions on Information Systems 
16 322-346. 

Laurberg, H., Christensen, M. G., Plumbley, M. D., Hansen, L. K. and Jensen, 
S. H. (2008). Theorems on positive data: On the uniqueness of NMF. Computational 
Intelligence and Neuroscience 2008. 

Lazzeroni, L. and Owen, A. (2002). Plaid models for gene expression data. Statist. Sinica 
24 61-86. MR1894189 

Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix 

factorization. Nature 401 788-791. 
Louwerse, D. J., Smilde, A. K. and Kiers, H. A. L. (1999). Cross-vahdation of multiway 

component models. Journal of Chemometrics 13 491-510. 
Mardia, K. v., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic 

Press, London. MR0560319 
McCuLLAGH, P. (2000). Resamphng and exchangeable arrays. Bernoulli 6 285-301. 

MR1748722 

MiNKA, T. P. (2000). Automatic choice of dimensionality for PGA. In NIPS 2000 598-604. 
MuiRHEAD, R. (1982). Aspects of Multivariate Statistical Theory. Wiley, New York. 
MR0652932 

Oba, S., Sato, M., Takemasa, L, Monden, M., Matsubara, K. and Ishii, S. (2003). 

A Bayesian missing value estimation method for gene expression profile analysis. Bioin- 

formatics 19 2088-2096. 
Onatski, a. (2007). Asymptotics of the principal components estimator of large factor 

models with weak factors and i.i.d. Gaussian noise. Technical report, Columbia Univ. 
Owen, A. B. (2007). The pigeonhole bootstrap. Ann. Appl. Statist. 1 386-411. MR2415741 



BI-CROSS- VALIDATION OF SVD AND NMF 



33 



Porter, M. (1980). An algorithm for suffix stripping. Program 14 130-137. 

RODWELL, G., SONU, R., Zahn, J. M., Lund, J., Wilhelmy, J., Wang, L., Xiao, W., 

MiNDRiNOS, M., Crane, E., Segal, E., Myers, B., Davis, R., Higgins, J., Owen, A. 

B. and Kim, S. K. (2004). A transcriptional profile of aging in the human kidney. PLOS 

Biology 2 2191-2201. 

SCHWARZ, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-464. 
MR0468014 

Shao, J. (1997). An asymptotic theory for linear model selection. Statist. Sinica 7 221-264. 
MR1466682 

SOSHNIKOV, A. (2001). A note on universality of the distribution of the largest eigenvalues 

in certain sampling covariances. J. Statist. Phys. 108 5-6. MR1933444 
TiAN, Y. (2004). On mixed-type reverse-order laws for the Moore-Penrose inverse of a 

matrix product. Int. J. Math. Math. Sci. 2004 3103-3116. MR2110791 
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., 

BOTSTEIN, D. and Altman, R. B. (2001). Missing value estimation methods for DNA 

microarrays. Bioinformatics 17 520-525. 
Wold, H. (1966). Nonlinear estimation by iterative least squares procedures. In Research 

Papers in Statistics: Festschrift for J. Neyman (F. N. David, ed.) 411-444. Wiley, New 

York. MR0210250 

Wold, S. (1978). Cross- validatory estimation of the number of components in factor and 
principal components models. Technometrics 20 397-405. 

Department of Statistics 
Stanford University 
Stanford, California 94305 
USA 

E-MAIL: owen@stat.stanford.edu 
patperryOstanford.edu 



